
Anomaly Detection in Financial Institutions Using Isolation Forest
Introduction
Anomaly detection involves identification of relative unusual data . It plays a critical role in identifying fraudulent activities, in the context of this project credit card transactions, where the goal is to detect deviating patterns that may indicate fraudulent behavior in credit cards. One machine learning algorithm used for anomaly detection is Isolation Forest. Unlike traditional methods that rely on labeled data, Isolation Forest is an unsupervised technique designed to detect outliers (anomalies) by isolating data points that significantly differ from the majority. The algorithm works by constructing multiple decision trees to separate instances in the data set, and since anomalies are rare and distinct, they are more likely to be isolated quickly. In the context of credit card transactions, the target is to identify fraudulent transactions that deviate from normal spending patterns, which might involve unusual transaction amounts, locations, or time. Isolation Forest efficiently identifies these outliers, making it an ideal tool for detecting potential fraud in real-time without requiring extensive labeled data, enabling quicker responses to suspicious activities and enhancing security in financial systems.
Intuition
Anomaly detection (also called “outlier detection” or “novelty detection”) is the process of identifying data points that significantly differ from the rest of the set of observations . The “anomalies” are instances that do not conform to the expected pattern or behavior of the majority of the data. Depending on the specific method used, some assumptions underlying anomaly detection are that the majority of data points represent typical or normal behavior, while anomalies represent rare or unusual events that deviate from the norm. For example, in a card fraud detection system, most of the transactions are legitimate, but fraudulent transactions are anomalies that deviate from the usual patterns of the customer or a group.
In many anomaly detection methods, an anomaly (abnormal) data point is distant or significantly different from the rest of the data points. This could be in terms of distance or density; some data points are far away from the majority in feature space or they are in areas with very low data density (i.e., sparse regions in the data space).
Anomalies are often linked to rare events that do not conform to the usual patterns. In card fraud detection, most financial transactions are similar to the rest, but a high-value withdrawal from an unusual location could be an anomaly suggesting fraudulent activity. In addition, they have a low probability of occurring according to a certain distribution.
The type of anomaly detection covered in this paper is “Isolation Forest”; a tree-based algorithm which works by recursively partitioning data and isolating points that are different from the rest. Isolation Forest is particularly useful when dealing with large, high-dimensional data sets because it does not rely on distance or density-based approaches, which are often computationally expensive for large data sets.
The core idea of Isolation Forest is based on the assumption that anomalies (outliers) are few and different compared to the majority of the data (normal points), this makes it easier to isolate them with fewer splits. In contrast, normal data points are often more similar to other points and require more splits to isolate them. This difference in how many splits are needed to isolate the data points is what the algorithm uses to distinguish anomalies from normal points. This process is similar to building a decision tree, where data points are split into two subsets based on a random feature and a random split value. Since anomalous points are usually far from the bulk of the data, meaning that they tend to fall into smaller regions of the feature space quickly. Hence, fewer splits are required to isolate these outliers.
The isolation forest algorithm works by creating a series of random binary trees (called Isolation Trees) in which data points are recursively split at random feature values. Each tree is created by randomly selecting features and randomly choosing split points along those features. In which a short path length (i.e., fewer splits) in the tree suggests that the point is anomalous, while a longer path length (more splits) suggests that the point is normal. The path length is the number of splits required to isolate a point from the rest of the data in the tree.
To measure the performance of the model, an anomaly score is computed by averaging the path lengths across all trees in the forest. A short average path length across all trees indicates that the point is anomalous, while a longer average path length indicates that the point is more likely to be normal. This comes from the idea that points with shorter average path lengths are more easily isolated, and therefore have higher anomaly scores. The score ranges from 0 to 1, where values closer to 1 indicate anomalies and values closer to 0 indicate normal points.
Some of the advantages of Isolation Forest are scalability, good performance for high-dimensional data and parameter tuning is not as relevant. Isolation Forest is efficient for large data sets because it uses a small number of trees (compared to other methods that might require intensive calculations like k-nearest neighbors or clustering). As well, the algorithm doesn’t rely on calculating distances or densities, which can become computationally expensive in high-dimensional spaces. Lastly, the algorithm is relatively simple and doesn’t require a lot of parameter tuning. Nonetheless, the number of trees and the depth of the trees are the primary hyperparameters, and their values don’t need to be finely tuned.
The Isolation Forest algorithm is an efficient and effective anomaly detection technique, especially suited for large and high-dimensional data. Its intuition is based on the idea that anomalies are easy to isolate because they differ significantly from the normal data, which makes them stand out when partitioning the data. Calculating the path length required to isolate data points that allows for quick identification of anomalies within the data.
Fundamentals
How anomaly detection through Isolation Forest works:
To isolate a data point through a binary tree structure called Isolation Tree (iTree) the data will be repeatedly split by randomly chosen features and random split values between the minimum and maximum values allowed for a feature. This will generate the path length in the tree from the root to a terminating (external) node.
For example, in a set of \(d\)-dimensional points, where \(\mathbf{X} = \{ x_1, \dots, x_n \}\), a subset is \(\{ X' \subset \mathbf{X} \}\).
An iTree, which is a data structure, has the following properties:
Each of the nodes \(T\) in the iTree is either an external node with no other nodes or an internal node with one “test” and exactly two other nodes, \(T_l\) and \(T_r\).
A test at the node \(T\) is made out of a feature \(q\) and a split value \(p\) such that the test \(q < p\) will determine the line that intersects two or more other lines in the same plane at different points on either \(T_l\) or \(T_r\).
The algorithm periodically divides \(X'\) through random selection of a variable \(q\) and a split value \(p\), until either:
The node has only one observation
All the data at the node have the same values.
Once the iTree is done growing, each of the observations in \(\mathbf{X}\) is isolated at one of the outer nodes. The ones with the smallest path length, \(h(x_i)\), in the tree for the point \(x_i \in \mathbf{X}\), is defined as the number of edges the point \(x_i\) traverses from the beginning of the tree to the last node.
Anomaly detection with Isolation Forest proceeds as follows:
Use the training dataset to build a number of iTrees.
For each data point in the test set:
Pass it through all the iTrees, counting the path length for each tree.
Assign an anomaly score to the point.
Label the point as an anomaly if its score is greater than a predefined threshold, which depends on the domain.
Anomaly Score
The algorithm for computing the anomaly score of a data point is based on the observation that the structure of iTrees is similar to Binary Search Trees (BST): a termination at an external node of the iTree corresponds to an unsuccessful search in the BST. Therefore, the estimation of the average \(h(x)\) for external node terminations is equivalent to that of unsuccessful searches in a BST.
Specifically, the average path length \(c(m)\) for a sample size \(m\) is given by:
\[ c(m) = \begin{cases} 2 H(m-1) - \frac{2(m-1)}{n} & \text{for } m > 2 \\ 1 & \text{for } m = 2 \\ 0 & \text{otherwise} \end{cases} \]
Where:
\(n\) is the test set size
\(m\) is the sample set size
\(H(i) = \ln(i) + \gamma\), with \(\gamma = 0.5772156649\) being the Euler-Mascheroni constant.
In this context, \(c(m)\) represents the average path length \(h(x)\) for a given sample size \(m\), and it is used to normalize \(h(x)\) to estimate the anomaly score for a given observation \(x\).
The anomaly score \(s(x, m)\) is then computed as:
\[ s(x, m) = 2^{-\frac{E(h(x))}{c(m)}} \]
Where: \(E(h(x))\) is the average path length \(h(x)\) from a collection of iTrees.
Interpreting the Anomaly Score:
If \(s(x)\) is close to 1, the point \(x\) is very likely an anomaly.
If \(s(x)\) is smaller than 0.5, the point \(x\) is likely normal.
If all points in the sample have scores around 0.5, it is likely that they are all normal.
This process allows Isolation Forest to efficiently detect anomalies based on the ease with which a point can be isolated in the feature space.
Application of Isolation Forest for Anomaly Detection of Credit Card Transactions
The dataset comes from Kaggle:
Description of the variables of the dataset:
trans_date, trans_time – Timestamp of the transaction
cc_num – Unique (anonymized) credit card number
merchant – Merchant where the transaction occurred
category – Type of transaction (e.g., travel, food, personal care)
amt – Transaction amount
Cardholder Details:
first, last – First and last name of the cardholder
gender – Gender of the cardholder
street, city, state, zip’ – Address of the cardholder
lat, long – Geographical location of the cardholder
city_pop – Population of the cardholder’s city
job – Profession of the cardholder
dob – Date of birth of the cardholder
trans_num – Unique transaction identifier
unix_time – Timestamp of transaction in Unix format
Merchant Details:
merch_lat, merch_long – Merchants location (latitude & longitude)
Fraud Indicator:
is_fraud – Target variable (1 = Fraud, 0 = Legitimate)
Visualizing the data
The correlation matrix below allows us to see the relationships between pairs of variables in the dataset. When two variables have a high correlation ( close to 1 or -1) it indicates that they move together and we can remove (or combine them) to avoid multicollinearity. But in the case of creating an Isolation Forest it allows us to see which features can be removed without loosing significant information to allow for a simpler model. Also, by removing them we can reduce dimensionality while improving our model’s performance.
Eliminate some of the unecessary variables that have high correlation such as long, merch_long, merch_lat, lat and the variables that are tend to be unique to individuals such as last, street, merchant, city, and trans_time (since we also have the unix time which also account for the date of the transaction not just the time).
Distribution of the Variables
The following graphs allows us to picture the distribution of the variables that we will use for out isolation forest and be able to see the major patterns that they have.

Based on the picture above showing the distribution of the numeric variables variables that will be used. The distribution for the transaction time is based on seconds ( in 24 hours there are 86400 seconds) in which appears to be roughly uniform with no strong peak times. ATM is the amount of the transaction in dollars, heavily skewed to the right, indicating that the majority of the transactions are small amounts and the long tail could indicate typical consumer behavior; most purchases with a credit card are low to moderate with few large ones. The distribution across zip codes is fairly uniform, the peaks at specific codes could reflect areas with higher transaction activity or more people with credit cards. The city- pop, describes the city population, which is skewed to the right, indicating that most transactions occur in cities with small populations. The unix time of transaction appears to have grouped peaks indicating that some periods have higher transaction frequencies such as burst of activity or time specific trends ( holidays, weekends, etc.)
Distribution of all the categorical variables

In the set of categorical variables, the (transaction) category distribution looks relatively even but grocery_net, shopping_pos, gas_transport and shooping_net seem more popular, indicating that the majority of the transactions are related to everyday purchases. The distribution for gender is almost perfectly balanced between male and female categories. For the state distribution, CA, TX, NY and FL have much higher count of transactions, likely reflecting the population size or sampling concentration in the dataset.
The data frame below shows the proportion (prop_train) and the number (count_train) of the transactions that are legitimate and the ones that are fraudulent in the whole dataset. At first glance we can observe that there is less than 1% of the transactions that are labeled as fraud compared to the rest of the data.
| Transaction Type | Percentage (%) | Count | |
|---|---|---|---|
| 0 | Legitimate | 99.43 | 1042569 |
| 1 | Fraud | 0.57 | 6006 |
Isolation Forest with the package isotree
First, some of the hyperparameters that could be tuned are ntrees that stands for the number of trees, max_depth is the maximum depth of the binary tree to grow, the standard measure is used in this example. Some other argument used are dim which is the number of columns to combine to produce a split, and scoring_metric which is the selected metric to use for determining outlier scores.
The isolation forest will produce the anomaly score ranging from 0 to 1 as explained on the intuition and fundamentals sections.
1 : it can be interpreted as an anomaly.
0.5 : it can be confirmed that there is no anomaly for the sampling dataset.
< 0.5 : It can be interpreted as a regular point.
To analyze the performance of the algorithm, we will use confusion matrix. The confusion matrix will be focusing on precision over the recall since we will try to minimize the normal customer wrongly categorized as customer with fraudulent transaction. Otherwise the bank may loss the valuable customer for their business target.
Before continuing into confusion matrix, we will divide the anomaly score with the a specific threshold. If the anomaly score is more than 80% quantile for the rest of data, we will classify the point as an anomaly. This threshold can be tolerated with respective subject business matter.
0% 80%
0.2905683 0.3334917
Now we add a new column labeled fraud_detection that is 1 = fraud if the fraud_score is greater or equal to the 80% quantile or 0 = legitimate if its less less than the 80% quantile.
Now we perform the confusion matrix:
The results below are:
True Negatives (TN): 837521→ Correctly identified legitimate transactions.
False Positives (FP): 1,338→ Misclassified legitimate transactions as fraud.
False Negatives (FN): 205,048 → Huge number of missed fraud cases.
True Positives (TP): 4,668 → Correctly detected fraud cases.
However, the model has an accuracy of 80.3% which seems high, but the value is inflated due to the majority of the transactions being legitimate. In that case, accuracy alone is not a good measure in such an imbalanced dataset. In addition, precision (Positive Predictive Value for Fraud)is 77.7% , when the model predicts fraud, it is correct 77.7% of the time, and balanced Accuracy is 51.03%, just slightly better than random guessing (50%)
Confusion Matrix and Statistics
Reference
Prediction 0 1
0 920599 121970
1 1875 4131
Accuracy : 0.8819
95% CI : (0.8813, 0.8825)
No Information Rate : 0.8797
P-Value [Acc > NIR] : 5.645e-12
Kappa : 0.0522
Mcnemar's Test P-Value : < 2.2e-16
Sensitivity : 0.032759
Specificity : 0.997967
Pos Pred Value : 0.687812
Neg Pred Value : 0.883010
Prevalence : 0.120259
Detection Rate : 0.003940
Detection Prevalence : 0.005728
Balanced Accuracy : 0.515363
'Positive' Class : 1
Overall this could mean that, the model is heavily biased toward legitimate transactions. It rarely catches fraud (low recall), meaning many fraudulent transactions go undetected. The high precision (77.7%) means that when the model does predict fraud, it is often right, but this comes at the cost of missing nearly all fraud cases.
In addition, we can perform an ROC curve ( receiver operating characteristic curve) to show the performance of this classification model at different classification thresholds to evaluate the performance of the model.

The ROC curve is well above the diagonal line which means that the model has predictive power. At the begging it starts steeply, indicating good early separating of fraud and legitimate transactions. However, it flattens out, which means that further improvements come at the cost of more false positives. This ROC curve suggest an AUC (Area under the curve) likely between 0.80 to 0.85, which is decently strong.
To be sure, we will calculate the AUC to evaluate the performance of our binary classification model (Isolation Forest). Below, the calculations show a AUC of 0.858 which is a good results since the more AUC is close to 1, the prediction performance is be better.
[1] 0.8707329
Visualizing the Isolation Forest Score
A plot will be needed to visualize the score produced by the Isolation Forest algorithm. The visualization will give us more insight into the polarization of the anomaly score. Since the visualization is better with fewer dimensions, we will compress the data into the least possible dimensions using PCA; then, we will take the dimensions that explain most of the variation in the data produced by the PCA and visualize it in a plot. I am choosing PCA because of it’s main application of dimensionality reduction to reduce the number of variables (dimensions) in the dataset while retaining as much of the variance (information) as possible.

The relative majority of the data seems to be retained by the first three dimensions which is 62.1%.
Also, we will create a plot for the spreading of the anomaly score. By using the first three dimensions, we create the Isolation Forest and visualize the anomaly score.
d1 d2 d3
1 -1.8208001 -1.67721 -2.11088
2 -1.6279869 -1.67721 -2.11088
3 -1.4351737 -1.67721 -2.11088
4 -1.2423605 -1.67721 -2.11088
5 -1.0495473 -1.67721 -2.11088
6 -0.8567341 -1.67721 -2.11088